Skip to content

Support raw HSACO custom kernargs#12

Open
AWoloszyn wants to merge 11 commits into
mainfrom
users/awoloszyn/amdgpu-raw-hsaco-custom-kernargs
Open

Support raw HSACO custom kernargs#12
AWoloszyn wants to merge 11 commits into
mainfrom
users/awoloszyn/amdgpu-raw-hsaco-custom-kernargs

Conversation

@AWoloszyn
Copy link
Copy Markdown
Collaborator

@AWoloszyn AWoloszyn commented May 28, 2026

Preserve native kernarg layouts for raw HSACO kernels so prepacked HIP argument buffers from libraries such as hipBLASLt can dispatch correctly through the AMDGPU HAL.

Preserve native kernarg layouts for raw HSACO kernels so prepacked HIP argument buffers from libraries such as hipBLASLt can dispatch correctly through the AMDGPU HAL.
@stellaraccident
Copy link
Copy Markdown
Collaborator

CI caught an error.

@AWoloszyn AWoloszyn marked this pull request as ready for review June 1, 2026 12:56
Comment thread runtime/src/iree/hal/executable.h Outdated
Comment thread runtime/src/iree/hal/drivers/amdgpu/executable.c Outdated
Comment thread runtime/src/iree/hal/drivers/amdgpu/executable.c Outdated
Comment thread runtime/src/iree/hal/drivers/amdgpu/device/dispatch.h Outdated
Comment thread runtime/src/iree/hal/drivers/amdgpu/device/dispatch.c Outdated
Comment thread runtime/src/iree/hal/drivers/amdgpu/abi/kernel_args.h Outdated
Comment thread runtime/src/iree/hal/drivers/amdgpu/util/hsaco_metadata.c Outdated
Comment thread runtime/src/iree/hal/drivers/amdgpu/util/hsaco_metadata.c Outdated
Comment thread runtime/src/iree/hal/drivers/amdgpu/aql_command_buffer.c Outdated
Comment thread runtime/src/iree/hal/drivers/amdgpu/aql_command_buffer.c Outdated
AWoloszyn and others added 8 commits June 1, 2026 13:36
Co-authored-by: Ben Vanik <ben.vanik@gmail.com>
Hoist raw HSACO kernel iteration into executable creation so the resolver only writes a single host kernel-args record.
Keep implicit-argument layout metadata in the dispatch kernarg layout instead of duplicating it in the hot kernel-args table.
@benvanik benvanik self-requested a review June 2, 2026 16:34
Copy link
Copy Markdown
Collaborator

@benvanik benvanik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Second-pass review notes. I agree this is moving in a better direction overall, but I think there are still a few contract issues that need tightening before this is safe to merge. Most of them come from one pressure point: the raw/native kernarg model is starting to share helpers with the regular HAL/flatbuffer model, and those two projections need to stay separate.

1. Raw kernarg reflection leaks into flatbuffer executable loading

The new default HSACO reflection projection computes constant_count from the full aligned native kernarg segment:

That helper is still used by the flatbuffer executable path, which validates it against the flatbuffer ExportDef.constant_count:

For a normal wrapped executable with two 64-bit bindings and two 32-bit constants, the native kernarg segment is 24 bytes, so this projects constant_count == 6. The flatbuffer ABI still says there are 2 constants. That means the flatbuffer path can reject valid wrapped executables because the raw-HSACO projection is now pretending padding and binding bytes are HAL constants.

The same leak shows up in parameter offsets. The public HAL contract says binding parameter offsets are binding ordinals:

But populate_default_export_parameters now writes raw kernarg byte offsets for bindings:

Suggested fix: split this into two explicit projections instead of trying to make one "default" helper mean both things.

  • Flatbuffer/HAL projection: compact visible metadata into HAL ABI reflection, with binding offsets as binding ordinals and constants as compact byte offsets/counts. This is what the wrapped executable path validates against ExportDef.
  • Raw HSACO/native projection: preserve native kernarg offsets and expose enough byte coverage for custom-direct prepacked argument blobs. This should be used only by the raw HSACO executable path.

That split also gives the names room to state the invariant. Something like calculate_hal_export_parameter_requirements versus calculate_native_kernarg_export_parameter_requirements would make accidental cross-use much harder.

2. Custom-direct validation accepts truly short argument buffers for the common no-implicit-suffix case

iree_hal_amdgpu_executable_initialize_dispatch_descriptor leaves custom_kernarg_layout as all-zero unless an implicit suffix is detected:

Both planners then use custom_kernarg_layout.explicit_kernarg_size as the required minimum:

When there is no implicit suffix, that required size is zero. So the relaxed validation accepts any custom direct blob length, including one that is actually shorter than the declared kernel kernarg segment. The writer then uses the caller's length as the dynamic layout size:

That is looser than the intended "allow missing trailing ABI padding" rule. It can accept a genuinely short prepacked buffer and dispatch with missing argument bytes.

Suggested fix: even for dynamic custom-direct layouts, preserve a validated minimum explicit byte count from metadata/symbol information. The dispatch-time rule should be roughly:

iree_host_align(constants.data_length, 8) >= layout->explicit_kernarg_size

Then the writer may clamp extra trailing bytes and zero missing padding up to the reservation. The dynamic part can still mean "the exact reservation may be caller-sized when no implicit suffix exists", but the minimum legal byte count should not be zero unless the kernel really has a zero-byte kernarg ABI.

3. Host-queue custom-direct block count truncates before the safety check

The host queue path computes a runtime block count from constants.data_length and immediately casts it to uint32_t:

The ring-capacity check later sees only the truncated value:

But the actual emplace path still memset/memcpys the full constants.data_length:

So an oversized iree_host_size_t length can wrap to a small block count, pass validation, reserve too few queue kernarg blocks, and then overrun the queue-owned kernarg allocation. This is security/correctness-sensitive because the direct host-queue path is not protected by the command-buffer qword length guards.

Suggested fix: keep the ceil-div result in iree_host_size_t, reject values greater than UINT32_MAX, then cast. It would be worth sharing a tiny checked helper with executable descriptor initialization so every path that converts kernarg byte length to block count has the same overflow behavior.

4. ELF symbol augmentation can read past malformed section bounds

The new synthetic kernel augmentation scans ELF symbol sections, but the appended-symbol pass checks only that symbol_section_offset <= elf_data.data_length:

It does not check symbol_section_size <= elf_data.data_length - symbol_section_offset. The first thing iree_hal_amdgpu_hsaco_metadata_is_elf_kernel_symbol does is read symbol[4]:

The second pass repeats the same unchecked range pattern:

So malformed ELF section metadata can produce out-of-bounds symbol reads before the later symbol-name helper gets a chance to reject the symbol.

Suggested fix: use the full range check used elsewhere in this file before deriving symbol_count, including multiplication/addition overflow checks for symbol_count * symbol_entry_size. If the section is malformed, skip it or fail loudly, but the code should never form a symbol pointer outside the ELF byte span.

5. append_elf_kernel_symbols leaks expansion allocations on failure

After the expansion storage is allocated:

a failure allocating extra_symbol_name_storage leaks expanded.kernels/args:

Later IREE_RETURN_IF_ERROR exits during the second scan can leak both allocations, because the expanded buffers have not yet been installed into metadata and the caller's cleanup only sees the original metadata object:

Suggested fix: make the expansion path a status-gated block with one terminal cleanup section. Until ownership is committed into metadata, expanded.kernels and expanded.owned_string_storage need local cleanup on every error path.

Larger design concern: synthetic ELF kernels need to be quarantined/opted in

This helper now invents metadata for ELF function symbols that do not have metadata entries by picking one template kernel and copying its ABI/workgroup metadata:

That is a very strong producer-specific assumption. It may match a Tensile/hipBLASLt packaging pattern, but it is not generally true for arbitrary raw HSACO: a code object can contain multiple kernels with different argument layouts, required workgroup sizes, group/private segment sizes, or hidden suffix behavior. Publishing synthetic exports with copied metadata makes those kernels look valid until dispatch, where the failure mode is wrong argument layout rather than a clean load-time error.

Suggested fix: quarantine this behind an explicit opt-in. A few possible shapes:

  • Put the synthetic-kernel expansion behind a raw-HSACO executable flag/option that names the producer assumption, instead of doing it unconditionally in iree_hal_amdgpu_hsaco_metadata_initialize_from_elf.
  • Restrict it to a narrow, validated pattern: exactly one metadata template kernel, extra ELF kernel symbols known to share the same ABI by producer contract, and ideally an executable parameter that says HRX is intentionally loading that bundle form.
  • Keep the generic parser honest: by default, only kernels with actual AMDGPU metadata should become dispatchable exports. Missing metadata should mean "not reflectable/not dispatchable by this generic path", not "borrow an arbitrary template".

If this expansion stays, tests should include at least one negative case with two real metadata kernels of different ABI plus one extra symbol, so the loader proves it does not silently clone the wrong contract.

@benvanik
Copy link
Copy Markdown
Collaborator

benvanik commented Jun 2, 2026

(my prompt to gpt5.5-xhigh was: › github PR review time! https://github.com/ROCm/hrx-system/pull/12/ - I've already done a first pass and think it looks better, but I'd like you to see if there's any other risks in here (I found some performance ones in our critical dispatch path, and safety is always an issue with ensuring proper iree_status_t tracking/etc) - I recommend doing the same - the concerns identified above are all valid)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants